Supplementary Material To: Classification of P-glycoprotein-interacting Compounds Using Machine Learning Methods

P-glycoprotein (Pgp) is a drug transporter that plays important roles in multidrug resistance and drug pharmacokinetics. The inhibition of Pgp has become a notable strategy for combating multidrug-resistant cancers and improving therapeutic outcomes. However, the polyspecific nature of Pgp, together with inconsistent results in experimental assays, renders the determination of endpoints for Pgp-interacting compounds a great challenge. In this study, the classification of a large set of 2,477 Pgp-interacting compounds (i.e., 1341 inhibitors, 913 non-inhibitors, 197 substrates and 26 non-substrates) was performed using several machine learning methods (i.e., decision tree induction, artificial neural network modelling and support vector machine) as a function of their physicochemical properties. The models provided good predictive performance, producing MCC values in the range of 0.739-1 for internal cross-validation and 0.665-1 for external validation. The study provided simple and interpretable models for important properties that influence the activity of Pgp-interacting compounds, which are potentially beneficial for screening and rational design of Pgp inhibitors that are of clinical importance.


A B
Coping with imbalanced data sets The fuzzy C-means clustering (FCM) algorithm divides the input data into many clusters in which every data point possesses a partial membership in multiple clusters rather than complete association with a single cluster.Therefore, data points in the center of a cluster have a greater degree of belonging than data points located at the edge of cluster.Initially, FCM divides the n vector of x i (i = 1, 2, 3,…, N) into c fuzzy groups, where The clustering center of each group is subsequently calculated, and the non-similarity index value function is minimized.For determination of membership in a cluster, a value of 0 or 1 is given to each data point.Subsequently, the element of the membership matrix is provided with the values of 0 and 1.The optimal cluster of each x i was obtained by minimizing the objective function of FCM (Zhou et al., 2010): where m is a real-valued number greater than 1, m ij u is the degree of membership of x i in the cluster j, c j is the center of the cluster j, and is the similarity function between any measured data and the center.In this study, the number of clusters was generated from the ratio of the number of samples in the positive class to the number of samples in the negative class.The ratio of inhibitors:non-inhibitors was approximately 1.47; therefore, 2 clusters were generated for the inhibitors data set.These clusters were consequently used to represent the inhibitors class, along with all 913 inhibitors, for construction of the classification models.Herein, the decision tree algorithm was used for empirical observation.The predictive performance of each model was compared using a set of statistical parameters, including % accuracy (Acc), % sensitivity (Sens), % specificity (Spec) and Matthews correlation coefficient (MCC).
Finally, inhibitors cluster 2 was selected as the best representative of the inhibitors for further CSPR analysis.a 10-fold cross validation was performed for internal validation of the models.
b The cluster that gives the best MCC was selected as representative for further CSPR analysis.

Multivariate analysis Decision tree
Decision tree analysis is a supervised machine-learning algorithm (Tarca et al., 2007;Witten et al., 2011) that has been widely used as a simple interpretation of binary classification (Tarca et al., 2007).
Decision tree analysis is a way to represent a series of rules that leads to a particular classification (Sharma & Jain, 2013).It divides input data into a range based on attribute values that it learned from the training data set (Patil & Sherekar, 2013).In this study, decision tree models were constructed using the J48 algorithm of the Weka software package version 3.7.11.(Witten et al., 2011).The process of J48 starts with creating if-then rules from the whole training set to split the data into two subsets, in which each subset contains data with the same feature value (Che et al., 2011).The splitting is performed through the use of internal nodes (i.e., independent variables) and external nodes (i.e., dependent variables) connected by branches (i.e., the cutoff value determining the class of the compounds) (Nantasenamat et al., 2013a).The tree initially finds and selects the most informative attribute (i.e., a descriptor as a root node for splitting data), followed by subsequent important attributes as internal nodes, until the terminal branch is reached (Nantasenamat et al., 2013a).The process continues until all samples in a subset are of the same class (Che et al., 2011).Initially, a large tree is grown and then pruned to reduce overfitting (Che et al., 2011).To produce a simple interpreted tree with the best performance, the minimum number of instances per leaf (miniNumObj) was optimized.The models were empirically constructed using varied miniNumObj.In addition, the validation set was used for an empirical search of The results of classification using varied miniNumObj of the inhibitors/non-inhibitors classifier are shown in Table S2.The best predictive performance of the inhibitors/non-inhibitors model was provided by the miniNumObj of 8. Regarding the predictive performance, the results of the models using this miniNumObj parameter were selected as final.

Artificial neural network (ANN)
ANN is a supervised learning algorithm that mimics the behavior of the human brain, where the neuronal nodes of ANN represent human neurons and the synaptic weights represent dendrites and axons (Nantasenamat et al., 2013b).ANN is comprises many artificial processing units located in 3 layers, including input, hidden and output layers (Sutariya et al., 2013).The strength of the connection between processing units is defined by synaptic weights that can be adjusted by the learning process (Sutariya et al., 2013).The values of independent variables are relayed to the input layers, and then the signals are sent to hidden layers and output layers via synaptic weights (Nantasenamat et al., 2013b).The artificial neurons in the hidden layers contain a sigmoidal transfer function ) ( S x (Eq.2), which computes and limits the signal of the output layer as 0 or 1 (Nantasenamat et al., 2013b).
where β is the slope parameter.The output layer contains a numerical class that is an unthresholded linear unit.In mathematical models, it may describe a neuron k, as follow: is the weight of neuron which is optimized by total squared error, o w . is the weight which corresponds to the bias input, and y ˆ is the output signal of the neuron.
The model is trained in a back-propagated manner, in which the difference between ! and y is calculated as the target error from the output layer through the hidden layer to the input layer, followed by a readjustment of the synaptic weights (Nantasenamat et al., 2013b).This process continues until reaching the assigned learning period and obtaining a minimized error and good prediction (Sutariya et al., 2013).
The initial synaptic weights are randomly assigned at the beginning of the learning process, which may give rise to a slight varied prediction.Therefore, ten rounds of calculations were performed and the average parameter value was calculated and used for construction of the ANN models.
The optimal value of parameters for ANN, including the number of hidden nodes, training time and learning rate and momentum, were empirically searched by software developed in-house, i.e., Autoweka.Ten calculations were performed, and the average RMSE values from these rounds were used to measure the predictive performance (Nantasenamat et al., 2013b).The optimal parameters are shown in Table S3.

Support vector machine (SVM)
SVM is a supervised machine-learning algorithm based on statistical learning theory (Vapnik, 1998;Vapnik, 2000).SVM is a classifier that can separate data from two classes by finding a unique separating hyperplane with maximum margin (Vapnik, 2000) to minimize the classification error (Zhou et al., 2010).
SVM searches for a set of data points that are the most difficult training points to be classified, which are defined as support vectors (Vapnik, 2000).These support vectors are closest to the hyperplane and located on the margin boundaries between the two classes (Yang, 2004).These striking characteristics contribute to the robustness and generalization ability of this classifier (Yang, 2004).Non-linear SVM was used in this study.Initially, non-linear original input X is projected into a higher dimensional feature space to allow the non-linear original data to be linearly separated in the transformed space using the kernel function (Eq.4) (Tarca et al., 2007).
( ) ( ) ( ) ( ) where K() represents the kernel function and!!!is a mapping function from the original input space into the feature space.Consequently, the linear classification model was constructed in the higher dimensional feature space, where the similarity between input data and support vectors was quantified by the kernel function, i.e., the radial basis function (RBF), as shown in Eq.5.
( ) where K() represents the kernel function and γ > 0 determines how the samples are transformed into a high-dimensional search space.As a result, data in higher dimensional space were linearly separated into two classes by the maximal-margin hyperplane, which provides the maximum distance between the two classes by being located centrally between the boundaries of each data class.The margin is defined as the distance between two marginal boundaries (2 ) where the support vectors are located.
The most critical step in constructing a well-generalized SVM classifier is to find the optimal parameters of the kernel function.In the case of non-linear SVM, parameters that should be considered include the complexity parameter (C), which searches for a balance between misclassification and simplicity, gamma ( ), which determines the extent to which one training sample affects the model, and epsilon ( ε ), which designates the exponent value (we note that a value for the linear kernel is 1).
To find the optimal parameters, a two-level grid search was performed using AutoWeka, which is a software program developed in-house.Initially, the global search was conducted by the systemic adjustment of exponential n values in the form of 2 n for C and parameters using a step size of 2. To obtain good performance, a more refined local grid search was performed on the regions from the global search using a step size of 0.25.The RMSE value was used for measurement of the predictive performance.Finally, SVM models were constructed by John Platt's Sequential Minimal Optimization (SMO) algorithm of the Weka software package version 3.7.11(Witten et al., 2011) using the optimal parameters obtained from the local grid search.The optimal parameters for SVM analysis are shown in Table S4.

Fig. S1 .
Fig. S1.A schematic workflow of the data set preparation.
suitable parameters.Matthews correlation coefficient (MCC) values for the training set (MCC tr ), 10-fold cross validation (MCC cv ) and a validation set (MCC v ) were used to determine the predictive performance of the model.Finally, the miniNumObj that gave the best MCC value was further used for the construction of the CSPR model based on J48.

Table S2 .
The parameter optimization of the decision tree models a Training.b 10-fold cross validation.c External test set.d The best predictive performance.

Table S3 .
Optimal parameters of the ANN

Table S4 .
Optimal parameters for the SVM models Parameters of the local search were used for the construction of the SVM models.
a Default ε value of 0.001 was used.b γ γ γ